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Abstract. EM algorithm is a convenient tool for maximum likelihood 
model fitting when the data are incomplete or when there are latent 
variables or hidden states. In this review article we explain that EM 
algorithm is a natural computational scheme for learning image tem- 
plates of object categories where the learning is not fully supervised. 
We represent an image template by an active basis model, which is a 
linear composition of a selected set of localized, elongated and oriented 
wavelet elements that are allowed to slightly perturb their locations 
and orientations to account for the deformations of object shapes. The 
model can be easily learned when the objects in the training images 
are of the same pose, and appear at the same location and scale. This 
is often called supervised learning. In the situation where the objects 
may appear at different unknown locations, orientations and scales in 
the training images, we have to incorporate the unknown locations, 
orientations and scales as latent variables into the image generation 
process, and learn the template by EM-type algorithms. The E-step im- 
putes the unknown locations, orientations and scales based on the cur- 
rently learned template. This step can be considered self-supervision, 
which involves using the current template to recognize the objects in 
the training images. The M-step then relearns the template based on 
the imputed locations, orientations and scales, and this is essentially 
the same as supervised learning. So the EM learning process iterates 
between recognition and supervised learning. We illustrate this scheme 
by several experiments. 

Key words and phrases: Generative models, object recognition, wavelet 
sparse coding. 
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1. INTRODUCTION: EM LEARNING SCHEME 

The EM algorithm [7] and its variations [14] have 
been widely used for maximum likelihood estimation 
of statistical models when the data are incompletely 
observed or when there are latent variables or hid- 
den states. This algorithm is an iterative computa- 
tional scheme, where the E-step imputes the miss- 
ing data or the latent variables given the currently 
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estimated model, and the M-step re-estimates the 
model given the imputed missing data or latent vari- 
ables. Besides its simplicity and stability, a key fea- 
ture that distinguishes the EM algorithm from other 
numerical methods is its interpretability: both the 
E-step and the M-step readily admit natural inter- 
pretations in a variety of contexts. This makes the 
EM algorithm rich, meaningful and inspiring. 

In this review article we shall focus on one impor- 
tant context where the EM algorithm is useful and 
meaningful, that is, learning patterns from signals 
in the settings that are not fully supervised. In this 
context, the E-step can be interpreted as carrying 
out the recognition task using the currently learned 
model of the pattern. The M-step can be interpreted 
as relearning the pattern in the supervised setting, 
which can often be easily accomplished. 

This EM learning scheme has been used in both 
speech and vision. In speech recognition, the train- 
ing of the hidden Markov model [16] involves the 
imputation of the hidden states in the E-step by 
the forward and backward algorithms. The M-step 
computes the transition and emission frequencies. In 
computer vision, we want to learn models for differ- 
ent categories of objects, such as horses, birds, bikes, 
etc. The learning is often easy when the objects in 
the training images are aligned, in the sense that the 
objects appear at the same pose, same location and 
same scale in the training images, which are defined 
on a common image lattice that is the bounding box 
of the objects. This is often called supervised learn- 
ing. However, it is often the case that the objects 
appear at different unknown locations, orientations 
and scales in the training images. In such a situa- 
tion, we have to incorporate the unknown locations, 
orientations and scales as latent variables in the im- 
age generation process, and use the EM algorithm to 
learn the model for the objects. In the EM learning 
process, the E-step imputes the unknown location, 
orientation and scale of the object in each training 
image, based on the currently learned model. This 
step uses the current model to recognize the object 
in each training image, that is, where it is, at what 
orientation and scale. The imputation of the latent 
variables enables us to align the training images, so 
that the objects appear at the same location, ori- 
entation and scale. The M-step then relearns the 
model from the aligned images by carrying out su- 
pervised learning. So the EM learning process it- 
erates between recognition and supervised learning. 
Recognition is the goal of learning the model, and 



it serves as the self-supervision step of the learning 
process. The EM algorithm has been used by Fergus, 
Perona and Zisserman [9] in training the constella- 
tion model for objects. 

In this article we shall illustrate EM learning or 
EM-like learning by training an active basis model [22, 
23] that we have recently developed for deformable 
templates [1, 24] of object shapes. In this model, a 
template is represented by a linear composition of a 
set of localized, elongated and oriented wavelet ele- 
ments at selected locations, scales and orientations, 
and these wavelet elements are allowed to slightly 
perturb their locations and orientations to account 
for the shape deformations of the objects. In the 
supervised setting, the active basis model can be 
learned by a shared sketch algorithm, which selects 
the wavelet elements sequentially. When a wavelet 
element is selected, it is shared by all the training 
images, in the sense that a perturbed version of this 
element seeks to sketch a local edge segment in each 
training image. In the situations where learning is 
not fully supervised, the learning of the active ba- 
sis model can be accomplished by the EM-type al- 
gorithms. The E-step recognizes the object in each 
training image by matching the image with the cur- 
rently learned active basis template. This enables 
us to align the images. The M-step then relearns 
the template from the aligned images by the shared 
sketch algorithm. 

We would like to point out that the EM algo- 
rithm for learning the active basis model is different 
than the traditional EM algorithm, where the model 
structure is fixed and only the parameters need to 
be estimated. In our implementation of the EM al- 
gorithm, the M-step needs to select the wavelet ele- 
ments in addition to estimating the parameters asso- 
ciated with the selected elements. Both the selection 
of the elements and the estimation of the parameters 
are accomplished by maximizing or increasing the 
complete-data log-likelihood. So the EM algorithm 
is used for estimating both the model structure and 
the associated parameters. 

Readers who wish to learn more about the active 
basis model are referred to our recent paper [23], 
which is written for the computer vision community. 
Compared to that paper, this review paper is writ- 
ten for the statistical community. In this paper we 
introduce the active basis model from an algorith- 
mic perspective, starting from the familiar problem 
of variable selection in linear regression. This paper 
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also provides more details about the EM-type algo- 
rithms than [23] . We wish to convey to the statistical 
audience that the problem of vision in general and 
object recognition in particular is essentially a sta- 
tistical problem. We even hope that this article may 
attract some statisticians to work on this interesting 
but challenging problem. 

Section 2 introduces the active basis model for 
representing deformable templates, and describes the 
shared sketch algorithm for supervised learning. Sec- 
tion 3 presents the EM algorithm for learning the 
active basis model in the settings that are not fully 
supervised. Section 4 concludes with a discussion. 

2. ACTIVE BASIS MODEL: AN 
ALGORITHMIC TOUR 

The active basis model is a natural generalization 
of the wavelet regression model. In this section we 
first explain the background and motivation for the 
active basis model. Then we work through a series of 
variable selection algorithms for wavelet regression, 
where the active basis model emerges naturally. 

2.1 From Wavelet Regression to Active Basis 

2.1.1 p>n regression and variable section Wave- 
lets have proven to be immensely useful for signal 
analysis and representation [8]. Various dictionaries 
of wavelets have been designed for different types of 
signals or function spaces [4, 19]. Two key factors 
underlying the successes of wavelets are the spar- 
sity of the representation and the efficiency of the 
analysis. Specifically, a signal can typically be rep- 
resented by a linear superposition of a small num- 
ber of wavelet elements selected from an appropriate 
dictionary. The selection can be accomplished by ef- 
ficient algorithms such as matching pursuit [13] and 
basis pursuit [5]. 

From a linear regression perspective, a signal can 
be considered a response variable, and the wavelet 
elements in the dictionary can be considered the 
predictor variables or regressors. The number of ele- 
ments in a dictionary can often be much greater than 
the dimensionality of the signal, so this is the so- 
called > n" problem. The selection of the wavelet 
elements is the variable selection problem in linear 
regression. The matching pursuit algorithm [13] is 
the forward selection method, and the basis pursuit 
[5] is the lasso method [21]. 




Fig. 1. A collection of Gabor wavelets at different locations, 
orientations and scales. Each Gabor wavelet element is a sine 
or cosine wave multiplied by an elongated and oriented Gaus- 
sian function. The wave propagates along the shorter axis of 
the Gaussian function. 

2.1.2 Gabor wavelets and simple VI cells Inter- 
estingly, wavelet sparse coding also appears to be 
employed by the biological visual system for repre- 
senting natural images. By assuming the sparsity of 
the linear representation, Olshausen and Field [15] 
were able to learn from natural images a dictionary 
of localized, elongated, and oriented basis functions 
that resemble the Gabor wavelets. Similar wavelets 
were also obtained by independent component anal- 
ysis of natural images [2]. From a linear regression 
perspective, Olshausen and Field essentially asked 
the following question: Given a sample of response 
vectors (i.e., natural images), can we find a dictio- 
nary of predictor vectors or regressors (i.e., basis 
functions or basis elements), so that each response 
vector can be represented as a linear combination of 
a small number of regressors selected from the dic- 
tionary? Of course, for different response vectors, 
different sets of regressors may be selected from the 
dictionary. 

Figure 1 displays a collection of Gabor wavelet ele- 
ments at different locations, orientations and scales. 
These are sine and cosine waves multiplied by elon- 
gated and oriented Gaussian functions, where the 
waves propagate along the shorter axes of the Gaus- 
sian functions. Such Gabor wavelets have been pro- 
posed as mathematical models for the receptive fields 
of the simple cells of the primary visual cortex or VI 
[6]. 

The dictionary of all the Gabor wavelet elements 
can be very large, because at each pixel of the image 
domain, there can be many Gabor wavelet elements 
tuned to different scales and orientations. Accord- 
ing to Olshausen and Field [15], the biological vi- 
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Fig. 2. Active basis templates. Each Gabor wavelet element is illustrated by a bar of the same length and at the same location 
and orientation as the corresponding element. The first row displays the training images. The second row displays the templates 
composed of 50 Gabor wavelet elements at a fixed scale, where the first template is the common deformable template, and the 
other templates are deformed templates for coding the corresponding images. The third row displays the templates composed of 
15 Gabor wavelet elements at a scale about twice as large as those in the second row. In the last row, the template is composed 
of wavelet elements at multiple scales, where larger Gabor elements are illustrated by bars of lighter shades. The rest of the 
images are reconstructed by linear superpositions of the wavelet elements of the deformed templates. 



sual system represents a natural image by a linear 
superposition of a small number of Gabor wavelet 
elements selected from such a dictionary. 

2.1.3 From generic classes to specific categories 
Wavelets are designed for generic function classes or 
learned from generic ensembles such as natural im- 
ages, under the generic principle of sparsity. While 
such generality offers enormous scope for the appli- 
cability of wavelets, sparsity alone is clearly inad- 
equate for modeling specific patterns. Recently, we 
have developed an active basis model for images of 
various object classes [22, 23]. The model is a natural 
consequence of seeking a common wavelet represen- 
tation simultaneously for multiple training images 
from the same object category. 

The active basis can be learned by the shared 
sketch algorithm that we have recently developed 
[22, 23]. This algorithm can be considered a paral- 
leled version of the matching pursuit algorithm [13]. 
It can also be considered a modification of the pro- 
jection pursuit algorithm [10]. The algorithm selects 
the wavelet elements sequentially from the dictio- 
nary. Each time when an element is selected, it is 
shared by all the training images in the sense that 
a perturbed version of this element is included in 
the linear representation of each image. Figure 3 il- 
lustrates the shared sketch process for obtaining the 
templates displayed in the second and third rows of 
Figure 2. 



Figure 2 illustrates the basic idea. In the first row 
there are 8 images of deer. The images are of the 
same size of 122 x 120 pixels. The deer appear at the 
same location, scale and pose in these images. For 
these very similar images, we want to seek a com- 
mon wavelet representation, instead of coding each 
image individually. Specifically, we want these im- 
ages to be represented by similar sets of wavelet ele- 
ments, with similar coefficients. We can achieve this 
by selecting a common set of wavelet elements, while 
allowing these wavelet elements to locally perturb 
their locations and orientations before they are lin- 
early combined to code each individual image. The 
perturbations are introduced to account for shape 
deformations in the deer. The linear basis formed 
by such perturbable wavelet elements is called an 
active basis. 

This is illustrated by the second and third rows of 
Figure 2. In each row the first plot displays the com- 
mon set of Gabor wavelet elements selected from a 
dictionary. The dictionary consists of Gabor wavelets 
at all the locations and orientations, but at a fixed 
scale. Each Gabor wavelet element is symbolically 
illustrated by a bar at the same location and orien- 
tation and with the same length as the correspond- 
ing Gabor wavelet. So the active basis formed by 
the selected Gabor wavelet elements can be inter- 
preted as a template, as if each element is a stroke 
for sketching the template. The templates in the sec- 
ond and third rows are learned using dictionaries of 
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Fig. 3. Shared sketch process for learning the active basis templates at two different scales. 



Gabor wavelets at two different scales, with the scale 
of the third row about twice as large as the scale of 
the second row. The number of Gabor wavelet ele- 
ments of the template in the second row is 50, while 
the number of elements of the template in the third 
row is 15. Currently, we treat this number tun- 
ing parameter, although they can be determined in 
a more principled way. 

Within each of the second and third rows, and for 
each training image, we plot the Gabor wavelet ele- 
ments that are actually used to represent the corre- 
sponding image. These elements are perturbed ver- 
sions of the corresponding elements in the first col- 
umn. So the templates in the first column are de- 
formable templates, and the templates in the re- 
maining columns are deformed templates. Thus, the 
goal of seeking a common wavelet representation for 
images from the same object category leads us to 
formulate the active basis, which is a deformable 
template for the images from the object category. 

In the last row of Figure 2, the common template 
is learned by selecting from a dictionary that con- 
sists of Gabor wavelet elements at multiple scales 
instead of a fixed scale. In addition to Gabor wavelet 
elements, we also include the center-surround differ- 
ence of Gaussian wavelet elements in the dictionary. 
Such isotropic wavelet elements are of large scales, 
and they mainly capture the regional contrasts in 
the images. In the template in the last row, the 
number of selected wavelet elements is 50. Larger 
Gabor wavelet elements are illustrated by bars of 
lighter shades. The difference of Gaussian elements 
are illustrated by circles. The remaining images are 
reconstructed by such multi-scale wavelet represen- 
tations, where each image is a linear superposition 
of the Gabor and difference of Gaussian wavelet el- 
ements of the corresponding deformed templates. 



While selecting the wavelet elements of the active 
basis, we also estimate the distributions of their co- 
efficients from the training images. This gives us a 
statistical model for the images. After learning this 
model, we can then use it to recognize the same type 
of objects in testing images. See Figure 4 for an ex- 
ample. The image on the left is the observed testing 
image. We scan the learned template of deer over 
this image, and at each location, we match the tem- 
plate to the image by deforming the learned tem- 
plate. The template matching is scored by the log- 
likelihood of the statistical model. We also scan the 
template over multiple resolutions of the image to 
account for the unknown scale of the object in the 
image. Then we choose the resolution and location 
of the image with the maximum likelihood score, 
and superpose on the image the deformed template 
matched to the image, as shown by the image on the 
right in Figure 4. This process can be accomplished 
by a cortex-like architecture of sum maps and max 
maps, to be described in Section 2.11. In machine 
learning and computer vision literature, detecting or 
classifying objects using the learned model is often 
called inference. The inference algorithm is often a 
part of the learning algorithm. For the active basis 
model, both learning and inference can be formu- 
lated as maximum likelihood estimation problems. 



Fig. 4. Left: Testing image. Right: Object is detected and 
sketched by the deformed template. 
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2.1.4 Local maximum pooling and complex VI cells 
Besides wavelet sparse coding theory for VI simple 
cells, another inspiration to the active basis model 
also comes from neuroscience. Riesenhuber and Pog- 
gio [17] observed that the complex cells of the pri- 
mary visual cortex or VI appear to perform local 
maximum pooling of the responses from simple cells. 
From the perspective of the active basis model, this 
corresponds to estimating the perturbations of the 
wavelet elements of the active basis template, so 
that the template is deformed to match the observed 
image. Therefore, if we are to believe Olshausen 
and Field's theory on wavelet sparse coding [15] and 
Riesenhuber and Poggio's theory on local maximum 
pooling, then the active basis model seems to be a 
very natural logical consequence. 

In the following subsections we shall describe in 
detail wavelet sparse coding, the active basis model, 
and the learning and inference algorithms. 

2.2 An Overcomplete Dictionary of Gabor 
Wavelets 

The Gabor wavelets are translated, rotated and 
dilated versions of the following function: 

G(xi,X2) ocexp{-[(xi/ai)2 + (x2M)']/2}e'"S 

which is sine-cosine wave multiplied by a Gaussian 
function. The Gaussian function is elongated along 
the X2-axis, with a2 > and the sine-cosine wave 
propagates along the shorter xi-axis. We truncate 
the function to make it locally supported on a finite 
rectangular domain, so that it has a well defined 
length and width. 

We then translate, rotate and dilate G{xi,X2) to 
obtain a general form of the Gabor wavelets: 

-Bxi,x2,s,a(3;'i,a;2) = G(xi/s,X2/s)/s^, 

where 

xi = {x'l — xi) cos Q + {x'2 — X2) sin a, 

£2 = — {x'l — xi) sin a + {x'2 — X2) cos a. 

Writing x = {xi,X2)-, each Bx,s,a is a localized func- 
tion, where x = {xi,X2) is the central location, s 
is the scale parameter, and a is the orientation. 
The frequency of the wave propagation in Bx^s,a is 

UJ = l/s. Bx,s,a = {Bx,s,a,0, -Bx,s,a,l), where Bx,s,a,0 

is the even-symmetric Gabor cosine component, and 
Bx,s,a,i is the odd-symmetric Gabor sine compo- 
nent. We always use Gabor wavelets as pairs of co- 
sine and sine components. We normalize both the 



Gabor sine and cosine components to have zero mean 
and unit £2 norm. For each Bx^s,a, the pair Bx^s,a,o 
and Bx^s,a,i are orthogonal to each other. 
The dictionary of Gabor wavelets is 

^ = {Bx,s,a,y{x,s,a)}. 

We can discretize the orientation so that a G {ovr/O, 
o = 0,...,O — 1}, that is, O equally spaced orienta- 
tions (the default value of O is 15 in our experi- 
ments). In this article we mostly learn the active 
basis template at a fixed scale s. The dictionary 
Q is called "overcomplete" because the number of 
wavelet elements in $7 is larger than the number of 
pixels in the image domain, since at each pixel, there 
can be many wavelet elements tuned to different ori- 
entations and scales. 

For an image i{x), with x € D, where D is a set of 
pixels, such as a rectangular grid, we can project it 
onto a Gabor wavelet Bx^s,a,r], = 0, 1. The projec- 
tion of I onto Bx^s,a,r], or the Gabor filter response 
at {x, s, a), is 

{^j Bx^s,a,ri) — ^ ^ -^("^ )Bx^s,a,ri{x )■ 
x' 

The summation is over the finite support of Bx^s,a,ri- 
We write {I,Bx,s,a) = {{I, Bx,s,a,o), Bx,s,a,i))- The 
local energy is 

Kl)-B^,s,a)P = (I,-Ba;,s,a,o)^ + C^, Bx^s,a,l)'^ ■ 

Bx^s,a)\'^ is the local spectrum or the magnitude 
of the local wave in image I at {x, s, a). 
Let 

' ' a xGD 

where \D\ is the number of pixels in I, and O is the 
total number of orientations. For each image I, we 
normalize it to I I/a^ , so that different images are 
comparable. 

2.3 Matching Pursuit Algorithm 

For an image I(x) where x ^ D, we seek to repre- 
sent it by 

n 

(1) l = Y,C^Bx^,s,o.^ + U, 

i=l 

where {Bxi^s,ai,i = 1, ■ ■ ■ ,n) C 0, is a set of Gabor 
wavelet elements selected from the dictionary 0, 
Cj is the coefficient, and U is the unexplained resid- 
ual image. Recall that each Bxi^s,ai is a pair of 
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Gabor cosine and sine components. So i?Xi,s,Qi = 

(-^^x'i,s,ai,0) -^x'i,s,ai,l) ) Q — (Ci,0)Q,l)) and CiBx^^s.Oi — 

Ci,oBxi,s,ai,o + Ci^iBxi,s,ai,i- We fix the scale param- 
eter s. 

In the representation (1), n is often assumed to 
be small, for example, n = 50. So the representation 
(1) is called sparse representation or sparse cod- 
ing. This representation translates a raw intensity 
image with a huge number of pixels into a sketch 
with only a small number of strokes represented 
by B = {Bxi^s,ai,i = 1, . . . , n). Because of the spar- 
sity, B captures the most visually meaningful el- 
ements in the image. The set of wavelet elements 
i = 1, . . . , n) can be selected from Q by 
the matching pursuit algorithm [13], which seeks to 



minimize 



|I-E"=iCi-Bx, 



by a greedy scheme. 



Algorithm (Matching pursuit algorithm). 

0. Initialize i 0, [/ ^ I. 

1. Let i i + 1. Let (xj,aj) = argmax^j^Q, |({7, 

Bx,s,a)\'^- 

2. Let Ci = {U, Bxi^s,ai)- Update U U — Ci x 

3. Stop if i = n, else go back to 1. 

In the above algorithm, it is possible that a wavelet 
element is selected more than once, but this is ex- 
tremely rare for real images. As to the choice of n 
or the stopping criterion, we can stop the algorithm 
if \ci\ is below a threshold. 

Readers who are familiar with the so-called "large 
p and small n" problem in linear regression may have 
recognized that wavelet sparse coding is a special 
case of this problem, where I is the response vec- 
tor, and each Bx^s,a G is a predictor vector. The 
matching pursuit algorithm is actually the forward 
selection procedure for variable selection. 

The forward selection algorithm in general can be 
too greedy. But for image representation, each Ga- 
bor wavelet element only explains away a small part 
of the image data, and we usually pursue the ele- 
ments at a fixed scale, so such a forward selection 
procedure is not very greedy in this context. 

2.4 Matching Pursuit for Multiple Images 

Let {Im,m = 1, . . . ,M} be a set of training im- 
ages defined on a common rectangle lattice D, and 
let us suppose that these images come from the same 
object category, where the objects appear at the 
same pose, location and scale in these images. We 



can model these images by a common set of Gabor 
wavelet elements, 

n 

(2) Im = '^Cm,iBx^,s,a, + Um, m = l,...,M. 



i=l 



Xi,s,ai 1 I 



1, . . . ,n) can be considered a com- 



B={B: 

mon template for these training images. Model (2) 
is an extension of model (1). 

We can select these elements by applying the match- 
ing pursuit algorithm on these multiple images si- 
multaneously. The algorithm seeks to minimize 
Ylm=i Urn " Y17=i Cm,iBx„s,a, |P by a greedy scheme. 

Algorithm 1 (Matching pursuit on multiple im- 
ages) . 

0. Initialize z 0. For m = 1,...,M, initialize 

1. i i + 1. Select 

M 

(xi,Qi) = argmax \{Um, Bx,s,a)\'^ ■ 

x,a ^ — ' 

m=l 

2. For m = l,...,M, let Cm,i = {Um, Bx^,s,a,) , and 

update Um Um — Cm,iBxi,s,ai- 

3. Stop \i i = n, else go back to 1. 

Algorithm 1 is similar to Algorithm 0. The differ- 
ence is that, in Step 1, (xj,Qj) is selected by maxi- 
mizing the sum of the squared responses. 

2.5 Active Basis and Local Maximum Pooling 

The objects in the training images share similar 
shapes, but there can still be considerable variations 
in their shapes. In order to account for the shape 
deformations, we introduce the perturbations to the 
common template, and the model becomes 



Im — ^ '] Cm,iBxi+Axm,i,s,ai+Aam.i + ^mi 



(3) 



i=l 



m=l M. 



Again, B = {Bxi,s,a^-,'i = 1, . . . , n) can be considered 
a common template for the training images, but this 
time, this template is deformable. Specifically, for 
each 

turbed to Sx',+Ax',„,,,s,Q,+Aa™,i, where Ax^.i is the 
perturbation in location, and Aa^.i is the perturba- 
tion in orientation. B^ = (-B^.i+Ax„,,,s,a,+AQ„,,, « = 
l,...,n) can be considered the deformed template 
for coding iniagG IjTi. Wg cbII thG basis forniGd by 
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B = (5., 



l,...,re) the active basis, and we 



call {Axm,i, Aa.m,i, i = 1, . . . , n) the activities or per- 
turbations of the basis elements for image m. Model 
(3) is an extension of model (2). 

Figure 2 illustrates three examples of active basis 
templates. In the second and third rows the 
templates in the first column are B = {Bx^^s,ai,'i' = 
1, . . . , n). The scale parameter s in the second row is 
smaller than the s in the third row. For each row, the 
templates in the remain columns are the deformed 
templates = (-B^-i+Ax',„,i,s,ai+Aa„,i, « = 1, •••,«), 
for m = 1, . . . , 8. The template in the last row should 
be more precisely represented by B = (-Bxi,si,Oi;^ = 
1 , . . . , n) , where each element has its own Si auto- 
matically selected together with {xi,ai). In this ar- 
ticle we focus on the situation where we fix s (default 
length of the wavelet element is 17 pixels). 

For the activity or perturbation of a wavelet ele- 
ment -Ba;,s,cn we assume that Ax = (d cos a, d sin a), 
with d E [—61,61]. That is, we allow B^^s^a to shift 
its location along its normal direction. We also as- 
sume Aa € [—62,62]- bi and 62 are the bounds for 
the allowed displacements in location and orienta- 
tion (default values: 61 = 6 pixels, and 62 = vr/lS). 
We define 

^4(0) = {(Ax = (dcosa,dsina), Aa) : 

de [-61,61], Aae [-62,62]} 

the set of all possible activities for a basis element 
tuned to orientation a. 

We can continue to apply the matching pursuit 
algorithm to the multiple training images, the only 
difference is that we add a local maximum pooling 
operation in Steps 1 and 2. The following algorithm 
is a greedy procedure to minimize the least squares 
criterion: 

2 



M 



(4) E 



■m=l 



n 
i=l 



Algorithm 2 (Matching pursuit with local max- 
imum pooling). 

0. Initialize i ^ 0. For m = 1,...,M, initialize 

1. i •(— z -|- 1. Select 



{.Xi,ai) 



M 



arg max 



max |([/m,B^+Ax,s,a+Aa)|^ 
(Ax,Ao)GA(a) 



2. For 771 = 1, ... , M, retrieve 

= arg max \{Um,Bx^+/^x^s,on+/\a)?' ■ 
(Aa;,Aa)eA(Qi) 

Let c,m,j ^ (^7m,-Bxi+Ax„,„s,a,+Aa,„,i), and update 

Urn ^ Um Cnn,iBx^-\-/\xm,i,s,ai+Aam,i' 

3. Stop if i = n, else go back to 1. 

Algorithm 2 is similar to Algorithm 1. The differ- 
ence is that we add an extra local maximization op- 
eration in Step 1: maX(Aa;,Aa)GA{a) \ {Um,Bx+Ax,s,a+Aa 

With {xi,ai) selected in Step 1, Step 2 retrieves the 
corresponding maximal (Ax, Aa) for each image. 

We can rewrite Algorithm 2 by defining Rmix, a) = 
{UmiBx,s,a) ■ Then instead of updating the residual 
image Um in Step 2, we can update the responses 
Rm{x,a). 

Algorithm 2.1 (Matching pursuit with local max- 
imum pooling). 

0. Initialize i •<— 0. For m = 1,...,M, initalize 

Rm{x,a) ^ {Im,Bx^s,a) for ah (x,a). 

1. i-^ i + Select 



(xj, aj) 



A4 



arg max > max \Rm{x + Ax, 

x,a ^-^ (Ax,Aa)GA(a) 
m=l 



2. For m = 1, . . . , M, retrieve 



{AXm,ij A(y,fn,i) 

= arg max |i?m(2;j + Ax, a^ -|- Aa)|^. 

(Ax,Aa)&A(ai) 

Let Cm,i ^ Rm{xi + Axm,i,ai + Aam,i), and update 
Rm{x,a) 

^ Rm iX} Oi) Cm,j {Bx^s.a j Bx^-\-Ax„i^i,s,ai+Aa^^i ) • 

3. Stop if i = n, else go back to 1. 

2.6 Shared Sketch Algorithm 

Finally, we come to the shared sketch algorithm 
that we actually used in the experiments in this pa- 
per. The algorithm involves two modifications to Al- 
gorithm 2.1. 
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Algorithm 3 (Shared sketch algorithm). 

0. Initialize z ^ 0. For m = l,...,M, initialize 

Rm{x,a) ^ {Im,Bx,s,a) for all (x,a). 

1. i i + 1. Select 

(Xj , ttj) 

M 

= argmax\ max h{\R„i{x + Ax, 

x,a ^-^ (Ax,Aa)£A(a) 
m=l 

a + Aa)\^). 

2. For m = 1, . . . , M, retrieve 

{Ax^Yi^i, Ao!jYi,i) 

= arg max \Rm{xi + Ax,ai + Aa)\'^ . 

{Ax,Aa)£A{ai) 

Let Cm,i ^ Rm{xi + Axm,i-, oti + Aam.i), and update 
Rmix, a) ^ if 

COT:T:{Bx,s,a,Bxi+Ax^^i,s,ai+Aa^^i) > 0. 

3. Stop if i = n, else go back to 1. 

The two modifications are as follows: 

(1) In Step 1, we change \Rmix + Ax, a + Aa)p 
to h{\Rm{x + Ax, a + Aa)p) where h{-) is a sigmoid 
function, which increases from to a saturation level 
C (default: ^ = 6), 



Intuitively, 'E^=iTaa^(^^x,Aa)&Aia) H\Rm{x + Ax, 
a + Aa)|^) can be considered the sum of the votes 
from all the images for the location and orientation 
(x, a), where each image contributes max(Aa;,A«)eA(a) 
h{\Rm{x + Ax, a + Aa)\'^) . The sigmoid transforma- 
tion prevents a small number of images from con- 
tributing very large values. As a result, the selection 
of (x,a) is a more "democratic" choice than in Al- 
gorithm 2, and the selected element seeks to sketch 
as many edges in the training images as possible. In 
the next section we shall formally justify the use of 
sigmoid transformation by a statistical model. 

(2) In Step 2, we update Rm{x,a) if Bx,s,a 
is not orthogonal to -B^,+Ax„,„s,ai+Aa^,,- That is, 
we enforce the orthogonality of the basis = 
(5x-,+Ax„,,,s,a,+Aa„,,,i = I, ■ ■ ■ ,n) for each training 
image m. Our experience with matching pursuit is 
that it usually selects elements that have little over- 
lap with each other. So for computational conve- 
nience, we simply enforce that the selected elements 



are orthogonal to each other. For two Gabor wavelets 
Bi and B2, we define their correlation as corr(i?i, S2) = 
E;^l=oEi2=o(-^l.^l'-^2,r,2)^ that is, the sum of 
squared inner products between the sine and co- 
sine components of Bi and B2. In practical imple- 
mentation, we allow small correlations between se- 
lected elements, that is, we update Rm{x,a) 

if coTT{Bx,s,a:Bxi+Ax^^i,s,ai+Aa,r.,i) > ^ (the default 
value of e = 0.1). 

2.7 Statistical Modeling of Images 

In this subsection we develop a statistical model 
for Ijn- A statistical model is not only important for 
justifying Algorithm 3 for learning the active ba- 
sis template, it also enables us to use the learned 
template to recognize the objects in testing images, 
because we can use the log-likelihood to score the 
matching between the learned template and the im- 
age data. 

The statistical model is based on the decomposi- 
tion Im = YlILl Cm,iBxi+Ax^,i,s,ai+Aam,i + Um, where 

= {Bx^+Ax^,i,s,ai+Aa^_,,i = l,...,n) is orthog- 
onal, and Cm,i = {lm,Bxi+Axm,i,s,ai+Aa,„j), SO Um 

lives in the subspace that is orthogonal to B^. In 
order to specify a statistical model for 1^ given B^, 
we only need to specify the distribution of {cm,i,i = 
1, . . . ,n) and the conditional distribution of Um given 

{Cm,i^ i — 1) • • • 5 ^)' 

The least squares criterion (4) that drives Algo- 
rithm 2 implicitly assumes that Um is white noise, 
and Cm,i follows a flat prior distribution. These as- 
sumptions are wrong. There can be occasional strong 
edges in the background, but a white noise Um can- 
not account for strong edges. The distribution of 
Cm,i should be estimated from the training images, 
instead of being assumed to be a flat distribution. 

In this work we choose to estimate the distribu- 
tion of Cm,i from the training images by fitting an 
exponential family model to the sample {cm,i,'m = 
1, . . . , M} obtained from the training images, and we 
assume that the conditional distribution of Um given 
{cm,i-,i = 1, . . . , n) is the same as the corresponding 
conditional distribution in the natural images. Such 
a conditional distribution can account for occasional 
strong edges in the background, and it is the use of 
such a conditional distribution of Um as well as the 
exponential family model for Cm,i that leads to the 
sigmoid transformation in Algorithm 3. Intuitively, a 
large response \Rm{x + Ax , a + Aa)\'^ indicates that 
there can be an edge at (x + Ax, a + Aa). Because 
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an edge can also be accounted for by the distribu- 
tion of Um in the natural images, a large response 
should not be taken at its face value for selecting 
the basis elements. Instead, it should be discounted 
by a transformation such as h{-) in Algorithm 3. 

2.8 Density Substitution and Projection Pursuit 

More specifically, we adopt the density substitu- 
tion scheme of projection pursuit [10] to construct a 
statistical model. We start from a reference distribu- 
tion In this article we assume that q(l) is the 
distribution of all the natural images. We do not 
need to know q(l) explicitly beyond the marginal 
distribution q{c) of c = {I,Bx^s,a) under q(X). Be- 
cause q(l) is stationary and isotropic, q{c) is the 
same for different {x,a). q{c) is a heavy tailed dis- 
tribution because there are edges in the natural im- 
ages. q{c) can be estimated from the natural images 
by pooling a histogram of {(I,^^;^^,^)) VI, V(a;,a)} 
where {1} is a sample of the natural images. 

Given B„ = {B^^+Axm,i,s,ai+Aa^,i,i = 1, • • • we 
modify the reference distribution q{lm) to a new 
distribution p{Im) by changing the distributions of 
Cm,i- Let Pi{c) be the distribution of Cm,i pooled 
from {cm,i,'m = 1, . . . , M}, which are obtained from 
the training images {Im,"^ = 1,...,M}. Then we 
change the distribution of Cm,i from q{c) to Pi(c), for 
each i = 1, . . . , n, while keeping the conditional dis- 
tribution of Um given {cm,i,i = 1, . . . , n) unchanged. 
This leads us to 

-a(I )Y]Pii^l!lA 

where we assume that {cm,i,i = 1, . . . ,n) are inde- 
pendent under both q{Im) and p(Im\^m), for or- 
thogonal Bm. The conditional distributions of Um 
given {cm,i,i = 1, . . . ,n) under p(Im\Bm) and q(Im) 
are canceled out in p{Im\Bm) / Q(i-m) because they 
are the same. The Jacobians are also the same and 
are canceled out. So p{lm\Bm)/qiim) = IYi=iPiicm,i)/ 
q{cm,i)- 

The following are three perspectives to view mod- 
el (6): 

(1) Classification: we may consider q{I) as repre- 
senting the negative examples, and {Im} are posi- 
tive examples. We want to find the basis elements 
iBx„s,a,,i = 1, . . . so that the projections Cm,i = 
{Im, B^^+Ax^^i,s,a,+Aa^J for i = 1, . . . , n distinguish 
the positive examples from the negative examples. 



(2) Hypothesis testing: we may consider q(l) as 
representing the null hypothesis, and the observed 
histograms of Cm,i, i = 1, ■ ■ ■ ,n are the test statistics 
that are used to reject the null hypothesis. 

(3) Coding: we choose to code Cm,i by Pi{c) in- 
stead of q{c), while continuing to code Um by the 
conditional distribution of Um 

given {c-m^i ? ^ — 1 1 • • • i ^) 

under qiV)- 

For all the three perspectives, we need to choose 
Bxi,s,ai so that there is big contrast between pi{c) 
and q{c). The shared sketch process can be con- 
sidered as sequentially flipping dimensions of q(lm) 
from q{c) to Pi{c) to fit the observed images. It is es- 
sentially a projection pursuit procedure, with an ad- 
ditional local maximization step for estimating the 
activities of the basis elements. 

2.9 Exponential Tilting and Saturation 
Transformation 

While Pi{c) can be estimated from {cm,i,m = 1, . . . , 
M} by pooling a histogram, we choose to parametrize 
Pi{c) with a single parameter so that it can be esti- 
mated from even a single image. 

We assume Pi{c) to be the following exponential 
family model: 

(7) p{c; A) = exp{Xh{r)}q{c), 

where A > is the parameter. For c = (co,ci), r = 
1*^1 ~ '^o ~^ '^l' 

Z{X) = j exp{A/i(r)}g(c) dc = Ejexp{A/i(r)}] 

is the normalizing constant. h{r) is a monotone in- 
creasing function. We assume Pi{c) = p(c;Aj), 
which accounts for the fact that the squared re- 
sponses {\Cm,i? = |(Im,^^^i+Ax™,i,s,a,+Aa,„,,)P,"l = 

1,...,M} in the positive examples are in general 
larger than those in the natural images, because 
B^ 

i+^Xm,i,s,ai-ir/\am,i tends to sketch a local edge seg- 
ment in each 1^. As mentioned before, q{c) is es- 
timated by pooling a histogram from the natural 
images. 

We argue that h{r) should be a saturation trans- 
formation in the sense that as r — )■ cx), h{r) approaches 
a finite number. The sigmoid transformation in (5) is 
such a transformation. The reason for such a trans- 
formation is as follows. Let q{r) be the distribution 
of r = |cp = under q{c) where I ~ qil). We 

may implicitly model q{r) as a mixture of Ponii") and 
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Posi'''), where pon is the distribution of r when B is 
on an edge in I, and pos is the distribution of r when 
B is not on an edge in I. Ponif) has a much heavier 
tail thanpos{r). Let q{r) = {I- po)Po{iir)+PoPon{r), 
where po is the proportion of edges in the natu- 
ral images. Similarly, let pi{r) be the distribution 
of r = |cp under Pi{c). We can model Pi{r) = (1 — 
Pi)PoH{r) + PiPonir), where pi > po, that is, the pro- 
portion of edges sketched by the selected basis ele- 
ment is higher than the proportion of edges in the 
natural images. Then, as r — > cxd, Pi{r)/q{r) — )■ Pi/po, 
which is a constant. Therefore, h{r) should saturate 
as r ^ oo. 

2.10 Maximum Likelihood Learning and Pursuit 
Index 

Now we can justify the shared sketch algorithm as 
a greedy scheme for maximizing the log- likelihood. 
With parametrization (7) for the statistical model 
(6), the log-likelihood is 

M 11 



m=l i=l 



(8) 



i=l 
n 

E 



PiiCn 



q{Cm,i) 
M 



m=l 



~M log Z{Xi 



where the mean parameter p{X) of the exponential 
family model is 



(10) 



/x(A) 



1 



Z{X) 



h{r) exp{A/i(r)}g(r) dr. 



We want to estimate the locations and orientations 
of the elements of the active basis , (xj , a j , z = 1 , . . . , n) , 
the activities of these elements, (Axm,j, Aa^.i, ^ = 
1, . . . , n), and the weights, (Aj, i = 1, . . . , n), by max- 
imizing the log-likelihood (8), subject to the con- 
straints that Bm = (5a;,+Ax„.,,s,a,+Aa„,, ,1 = 1,..., n) 

is orthogonal for each m. 

First, we consider the problem of estimating the 
weight Xi given B^- To maximize the log-likelihood 
(8) over Aj, we only need to maximize 

M 

k{Xi) = Xi ^ h{\{Im, Bxi+Ax^,i,s,ai+Aa^^i)\'^) 
m=l 

-M\ogZ{Xi). 

By setting l[{Xi) = 0, we get the well-known form of 
the estimating equation for the exponential family 
model, 

, , /^(Ai) 
(9) 

1 ^ 

77 ^ ^ h{\(}-m-,Bxi+Axm.,i,s,ai+A.am,i)\ )) 



M 



The estimating equation (9) can be solved easily be- 
cause /x(A) is a one-dimensional function. We can 
simply store this monotone function over a one-dimen- 
sional grid. Then we solve this equation by looking 
up the stored values, with the help of nearest neigh- 
bor linear interpolation for the values between the 
grid points. For each grid point of A, p{X) can be 
computed by one-dimensional integration as in (10). 
Thanks to the independence assumption, we only 
need to deal with such one-dimensional functions, 
which relieves us from time consuming MCMC com- 
putations. 

Next let us consider the problem of selecting (xj , Oi) , 
and estimating the activity (Axm,i, Aam,^) for each 
image Im- Let Aj be the solution to the estimat- 
ing equation (9). li{Xi) is monotone in ^m=i ^(K^m; 
Bx,+Axra,i,s,ai+Aam,i)?')- Thcrcforc, we need to find 
(xj,aj), and (Axm,j, Aa^.i)) by maximizing 
Em=i ^(l(Im,5xi+Ax„,„s,a,+Aa,„,.)P)- This justifies 
Step 1 of Algorithm 3, where X^m=i ^{\^m{x + Ax, 
a + Aa)p) serves as the pursuit index. 

2.11 SUM-MAX Maps for Template Matching 

After learning the active basis model, in partic- 
ular, the basis elements B = {Bx^.s^cti^'i = 1, - - - ,n) 
and the weights (Aj,i = l,...,n), we can use the 
learned model to find the object in a testing im- 
age I, as illustrated by Figure 4. The testing im- 
age may not be defined on the same lattice as the 
training images. For example, the testing image may 
be larger than the training images. We assume that 
there is one object in the testing image, but we do 
not know the location of the object in the testing 
image. In order to detect the object, we scan the 
template over the testing image, and at each loca- 
tion X, we can deform the template and match it 
to the image patch around x. This gives us a log- 
likelihood score at each location x. Then we can find 
the maximum likelihood location x that achieves the 
maximum of the log-likelihood score among all the 
X. After computing x, we can then retrieve the ac- 
tivities of the elements of the active basis template 
centered at x. 
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Algorithm 4 (Object detection by template 
matching) . 

1. For every x, compute 

lix) 

n 

5Z r^^A A™^^., ^/l(Kl,5x+x,+Ax,s,ai+Aa)P) 



2=1 



logZ(A,0 



2. Select a; = argmaxj; /(x). For i = l,...,n, re- 
trieve 

= arg max | (I, S£+^.,+Ax,s,«,+Aq) 
(Ax,Aa)eyl(ai) 

3. Return the location x, and the deformed tem- 
plate (5£+Xi+Aa.'i,s,ai+AQi,^ = 1, • • • 

Figure 4 displays the deformed template 
(S£+a.^+Ax„s,a,+Aa,,^ = 1, • • • ,?^), which is Superpo- 
sed on the image on the right. 

Step 1 of the above algorithm can be realized by 
a computational architecture called sum-max maps. 

Algorithm 4.1 (sum-max maps). 

1. For all (x,a), compute SUMl(x,a) = /i(|(I, 

2. For all (x,a), compute 

MAX1(x,q)= max SUMl(x + Ax, a + Aa). 

(Ax,Ao)eA(a) 

3. For all x, compute SUM2(x) = EILJ^* x 
MAXl(x + Xi,ai) -log^(^i)]- 

SUM2(x) is lix) in Algorithm 4. 

The local maximization operation in Step 2 of Al- 
gorithm 4.1 has been hypothesized as the function 
of the complex cells of the primary visual cortex 
[17]. In the context of the active basis model, this 
operation can be justified as the maximum likeli- 
hood estimation of the activities. The shared sketch 
learning algorithm can also be written in terms of 
sum-max maps. 

The activities (Axm,j, Aa^,*, ^ = 1, • • • , n) should 
be treated as latent variables in the active basis 
model. However, in both learning and inference al- 
gorithms, we treat them as unknown parameters, 
and we maximize over them instead of integrating 
them out. According to Little and Rubin [12], max- 
imizing the complete-data likelihood over the latent 



variables may not lead to valid inference in general. 
However, in natural images, there is little noise, and 
the uncertainty in the activities is often very small. 
So maximizing over the latent variables can be con- 
sidered a good approximation to integrating out the 
latent variables. 

3. LEARNING ACTIVE BASIS TEMPLATES 
BY EM-TYPE ALGORITHMS 

The shared sketch algorithm in the previous sec- 
tion requires that the objects in the training images 
{Im} are of the same pose, at the same location and 
scale, and the lattice of Im is the bounding box of 
the object in Im* If is often the case that the objects 
may appear at different unknown locations, orienta- 
tions and scales in {Im}- The unknown locations, 
orientations and scales can be incorporated into the 
image generation process as hidden variables. The 
template can still be learned by the maximum like- 
lihood method. 

3.1 Learning with Unknown Orientations 

We start from a simple example of learning a horse 
template at the side view, where the horses can face 
either to the left or to the right. Figure 5 displays the 
results of EM learning. The three templates in the 
first row are the learned templates in the first three 
iterations of the EM algorithm. The rest of the figure 
displays the training images, and for each training 
image, a deformed template is displayed to the right 
of it. The EM algorithm correctly estimates the di- 
rection for each horse, as can be seen by how the 
algorithm flips the template to sketch each training 
image. 



Let B = (5i = 



i,s,ait ' 



1, 



be the 



deformable template of the horse, and 



B 



h 



, n) be the de- 



formed template for Im. Then Im can either be gen- 
erated by Bm or the mirror reflection of Bm, that is, 



(B 



R{xi+Axm,i),!i,-{ai+Aam,i) ' ' 



l,...,n), where for 
X = (xi,X2), R{x) = (— xi,X2) (we assume that the 
template is centered at origin). We can introduce a 
hidden variable to account for this uncertainty, 
so that Zm = ^ Im is generated by Bm , and Zm = 
if Im is generated by the mirror reflection of Bm- 
More formally, we can define 'Bm{zm), so that 
Bm(l) = Bm, and Bm(0) is the mirror reflection 
of Bm. Then we can assume the following mixture 
model: Zm ~ Bernoulli(p), where p is the prior prob- 
ability that Zjji = 1 , and [Im | Zm ] ~p(Im|Bm(^;m), A), 
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where A = (Aj, i = l, . . . ,n). We need to learn B, and 
estimate A and p. 

A simple observation is that p(Im\^m{zm)) = 
p(Im(^m)|Bm), where 1^(1) = Im and I„,(0) is the 
mirror reflection of !„. In other words, in the case 
of Zm = ^, we do not need to make any change to 
Im or Bm- In the case of Zm = 0, we can either flip 
the template or flip the image, and these two alter- 
natives will produce the same value for the likehood 
function. 

In the EM algorithm, the E-step imputes Zm for 
m= 1,...,M using the current template B. This 
means recognizing the orientation of the object in 
Im- Given Zm, we can change Im to Im{zm), so that 



{Imizm)} become aligned with each other, if Zm are 
imputed correctly. Then in the M-step, we can learn 
the template from the aligned images {Im(-Zm)} by 
the shared sketch algorithm. 

The complete data log-likelihood for the mth ob- 
servation is 

logp(I 

m; Zm 
= Zmlogp(Im|Bm, A) 

+ {I - Zm)logp{ImmBm,A) 
+ Zmlogp+{l - Zm)log{l - p), 



tiril 'Sit ^ 

Bi^iiR^ 

Fig. 5. Template learned from images of horses facing two different directions. The first row displays the templates learned 
in the first 3 iterations of the EM algorithm. For each training image, the deformed template is plotted to the right of it. The 
number of training images is 57. The image size is 120 x 100 (width x height). The number of elements is 40. The number of 
EM iterations is 3. 
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which is Hnear in z^- So in the E-step, we only need 
to compute the predictive expectation of z^, 

Zm = Pr(zm = l|Bm, A,p) 

_ pp{Im\'Bm,A-) 

~ pp{Im\Bm,A) + (1 - p)p(I„(0)|B„, A)' 

Both logp(Im|Bm, A) and logp(Im(0)|Bm, A) are 
readily available in the M-step. 

The M-step seeks to maximize the expectation of 
the complete-data log-likelihood, 



E 

i=l 



M 



m=l 



ill) 



+ (l-z^)/i(|(I,„(0),5™,,)|2)) 
-MlogZ(A, 



(12) 



M 



m=l 



M 



+ log(l-p) M-^ 



m=l 



The maximization of (12) leads to p = Zm/M . 

The maximization of (11) can be accomplished by 
the shared sketch algorithm, that is, Algorithm 3, 
with the following minor modifications: 

(1) The training images become {Im, lm(0), ""^ = 
1, . . . , M}, that is, there are 2M training images in- 
stead of M images. Each 1^ contributes two copies, 
the original copy Im or Im(l), and the mirror reflec- 
tion lm(0). This reflects the uncertainty in z^. For 
each image Im, we attach a weight Zm to I^, and a 
weight 1 — Zm to lm(0). Intuitively, a fraction of the 
horse in 1^ is at the same orientation as the cur- 
rent template, and a fraction of it is at the opposite 
orientation — a "Schrodinger horse" so to speak. We 
use (Jfc, Wfc, /c = 1, . . . , 2M) to represent these 2M 
images and their weights. 

(2) In Step 1 of the shared sketch algorithm, we 
select (xi,ai) by 



2M 



argmaxN Wk max h{\Rk{x + Ax, 
(Aa;,Aa)eA(Q) 



k=l 



(3) The maximum likelihood estimating equation 
for A,- is 



2M 
M ^ 



, wi. max /i(|i?i,(xj + Ax, 

A;=l 

ai + Aa)p), 

where the right-hand side is the weighted average 
obtained from the 2M training images. 

(4) Along with the selection of Bx^^s,ai and the 
estimation of Aj, we should calculate the template 
matching scores 

logp(Jfe|Bfc, A) 

n 

= Aj max h(\Rk(xi + Ax, 



ai + Aa)\^) 

-logZ(Ai) 

. , 2M. This gives us logp(Im|Bm, A) and 
Bm,A), which can then be used in the 



for k = 1 
logp(Im 
E-step. 



a + Aa) 



We initialize the algorithm by randomly generat- 
ing Zm ~ Unif[0, 1], and then iterate between the M- 
step and the E-step. We stop the algorithm after a 
few iterations. Then we estimate Zm = l if -Zm, > 1/2 
and Zm = otherwise. 

In Figure 5 the results are obtained after 3 iter- 
ations of the EM algorithm. Initially, the learned 
template is quite symmetric, reflecting the confu- 
sion of the algorithm regarding the directions of the 
horses. Then the EM algorithm begins a process of 
"symmetry breaking" or "polarization." The slight 
asymmetry in the initial template will push the algo- 
rithm toward favoring for each image the direction 
that is consistent with the majority direction. This 
process quickly leads to all the images aligned to one 
common direction. 

Figure 6 shows another example where a template 
of a pigeon is learned from examples with mixed 
directions. 

We can also learn a common template when the 
objects are at more than two different orientations 
in the training images. The algorithm is essentially 
the same as described above. Figure 7 displays the 
learning of the template of a baseball cap from ex- 
amples where the caps turn to different orientations. 
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Fig. 6. Template learned from, 11 images of pigeons facing different directions. The image size is 150 x 150. The number of 
elements is 50. The number of iterations is 3. 



The E-step involves rotating the images by match- 
ing to the current template, and the M-step learns 
the template from the rotated images. 

3.2 Learning From Nonaligned Images 

When the objects appear at different locations in 
the training images, we need to infer the unknown 
locations while learning the template. Figure 8 dis- 
plays the template of a bike learned from the 7 train- 
ing images where the objects appear at different lo- 
cations and are not aligned. It also displays the de- 
formed templates superposed on the objects in the 
training images. 

In order to incorporate the unknown locations into 
the image generation process, let us assume that 
both the learned template B = {Bx^^s,ai , i = 1, . . . , n) 
and the training images {Im} are centered at origin. 
Then let us assume that the location of the object 
in image Im is x^*"-*, which is assumed to be uni- 
formly distributed within the image lattice of Im- 



Let us define Br 



.Ml 



'-"^) +Xi + /S.XmA,S,OLi + /S.amA ' 



i = 1, . . . , n) to be the deformed template obtained 
by translating the template B from the origin to 
and then deforming it. Then the generative 
model for Im is p(Im|Bm(x(™)), A). 

Just like the example of learning the horse tem- 
plate, we can transfer the transformation of the tem- 
plate to the transformation of the image data, and 



the latter transformation leads to the alignment of 
the images. Let us define Im(2;^™^) to be the image 
obtained by translating the image Im so that the 



center of Im 

(x^*")) is 



-X 



(m) 



Then p(Im|Bm 



A) = p(Im(x(™))|Bm, A). If we know x^^^ for m = 
1, . . . , M, then the images {Im(x^™^)} are all aligned, 
so that we can learn a template from these aligned 
images by the shared sketch algorithm. On the other 
hand, if we know the template, we can use the tem- 
plate to recognize and locate the object in each Im 
by the inference algorithm, that is, Algorithm 4, us- 
ing the sum-max maps, and identify x^"^^ Such con- 
siderations naturally lead to the iterative EM-type 
scheme. 

The complete-data log-likelihood is 



E 

i=l 



M 



.{m)^ 



m=l 



(13) 



M\ogZ{\) 



In the E-step we perform the recognition task by 
calculating 

Pm(:r) =Pr(x(™) =x|B,A) «p(Im(x)|Bm,A), Vx. 
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Fig. 7. Template learned from 15 images of baseball caps facing different orientations. The image size is 100 x 100. The 
number of elements is 40. The number of iterations is 5. 





Fig. 8. The first row shows the sequence of templates learned in the first 3 iterations. The first one is the starting template, 
which IS learned from the first training image. The second row shows the bikes detected by the learned template, where a 
deformed template is superposed on each training image. The size of the template is 225 x 169. The number of elements is 50. 
The number of iterations is 3. 



That is, we scan the template over the whole image 
Im, and at each location x, we evaluate the template 
matching between the image and the translated 
and deformed template 6^(2;). logp(Im(aj)|Bm, A) 
is the SUM2(x) output by the sum-max maps in 
Algorithm 4.1. This gives us prn{x), which is the 
posterior or predictive distribution of the unknown 
location x^"^) within the image lattice of Im- We can 
then use Pm{x) to compute the expectation of the 
complete-data log-likelihood (13) in the E-step. 

Our experience suggests that Pm{x) is always high- 
ly peaked at a particular position. So instead of 
computing the average of (13), we simply impute 
x^™) = aigmscKx Pm{x) ■ 

Then in the M-step, we maximize the complete 
data log- likelihood (13) by the shared sketch 
algorithm, that is, we learn the template B from 
{lrn{x^^^)^ . This step performs supervised learning 
from the aligned images. 

In our current experiment we initialize the algo- 
rithm by learning (B,A) from the first image. In 
learning from this single image, we set 61 = 62 = 0, 
that is, we do not allow the elements (-B^,^ ,5 q,^, i — 
l,...,?i) to perturb. After that, we reset hi and 62 
to their default values, and iterate the recognition 
step and the supervised learning step. 

In addition to the unknown locations, we also al- 
low the uncertainty in scales. In the recognition step, 
for each Im, we search over a number of different 
resolutions of Im- We take lm{x^^^) to be the opti- 
mal resolution that contains the maximum template 
matching score across all the resolutions. 

In Figure 8 the first row displays the templates 
learned over the EM iterations. The first template 
is learned from the first training image. Figures 9-12 
display more examples. 



4. DISCUSSION 

This paper experiments with EM-type algorithms 
for learning active basis models from training images 
where the objects may appear at unknown locations, 
orientations and scales. For more details on imple- 
menting the shared sketch algorithm, the reader is 
referred to [23] and the source code posted on the 
reproducibility page. 

We would like to emphasize two aspects of the 
algorithms that are different from the usual EM al- 
gorithm. The first aspect is that the M-step involves 
the selection of the basis elements, in addition to the 
estimation of the associated parameters. The second 
aspect is that the performance of the algorithms 
can rely heavily on the initializations. In learning 
from nonaligned images, the algorithm is initialized 
by training the active basis model on a single im- 
age. Because of the simplicity of the model, it is 
possible to learn the model from a single image. In 
addition, the learning algorithm seems to converge 
within a few iterations. 

4.1 Limitations 

The active basis model is a simple extension of 
the wavelet representation. It is still very limited in 
the following aspects. The model cannot account for 
large deformations, articulate shapes, big changes in 
poses and view points, and occlusions. The current 
form of the model does not describe textures and 
lighting variations either. The current version of the 
learning algorithm only deal with situations where 
there is one object in each image. Also, we have 
tuned two parameters in our implementation. One is 
the image resize factor that we apply to the training 
images before the model is learned. Of course, for 
each experiment, a single resize factor is applied to 
all the training images. The other parameter is the 
number of elements in the active basis. 
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Fig. 9. The first row shows the sequence of templates learned m iterations 0, 1, 3, 5. The second and third rows show the 
camel images with superposed deformed templates. The size of the template is 192 x 145. The number of elements is 60. The 
number of iterations is 5. 
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Fig. 10. The first row shows the sequence of templates learned in iterations 0, 1, 3, 5. The other rows show the crane images 
with superposed deformed templates. The size of the template is 285 x 190. The number of elements is 50. The number of 
iterations is 5. 
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Fig. 11. The first row shows the sequence of templates learned in iterations 0, 2, 4, 6, 8, 10. The other rows show the horse 
images with superposed deformed templates. We use the first 20 images of the Weizmann horse data set [3], which are resized 
to half the original sizes. The size of the template is 158 x 116. The number of elements is 60. The number of iterations is 
10. The detection results on the rest of the images in this data set can be found in the reproducibility page. 





4.2 Possible Extensions 

It is possible to extend the active basis model to 
address some of the above limitations. We shall dis- 
cuss two directions of extensions. One is to use active 
basis models as parts of the objects. The other is to 
train active basis models by local learning. 

Active basis models as part-templates: The active 
basis model is a composition of a number of Ga- 
bor wavelet elements. We can further compose mul- 
tiple active basis models to represent more articu- 
late shapes or to account for large deformations by 
allowing these active basis models to change their 
overall locations, scales and orientations within lim- 
ited ranges. These active basis models serve as part- 
templates of the whole composite template. This 
is essentially a hierarchical recursive compositional 
structure [11, 25]. The inference or template match- 
ing can be based on a recursive structure of sum-max 
maps. Learning such a structure should be possi- 
ble by extending the learning algorithms studied in 
this article. See [23] for preliminary results. See also 
[18, 20] for recent work on part-based models. 

Local learning of multiple prototype templates: In 
each experiment we assume that all the training im- 
ages share a common template. In reality, the train- 
ing images may contain different types of objects, 
or different poses of the same type of objects. It 
is therefore necessary to learn multiple prototype 
templates. It is possible to do so by modifying the 
current learning algorithm. After initializing the al- 
gorithm by single image training, in the M-step, we 
can relearn the template only from the K images 
with the highest template matching scores, that is, 
we relearn the template from the K nearest neigh- 
bors of the current template. Such a scheme is con- 
sistent with the EM-clustering algorithm for fitting 



mixture models. We can start the algorithm from 
every training image, so that we learn a local proto- 
type template around each training image. Then we 
can trim and merge these prototypes. See [23] for 
preliminary results. 

REPRODUCIBILITY 

All the experimental results reported in this paper 
can be reproduced by the code that we have posted 
at http : / /www . stat . ucla . edu/~ywu/ActiveBasis 
html. 

ACKNOWLEDGMENTS 

We are grateful to the editor of the special issue 
and the two reviewers for their valuable comments 
that have helped us improve the presentation of the 
paper. The work reported in this paper has been 
supported by NSF Grants DMS-07-07055, DMS-10- 
07889, IIS-07-13652, ONR N00014-05-01-0543, Air 
Force Grant FA 9550-08-1-0489 and the Keck foun- 
dation. 

REFERENCES 

[1] Amit, Y. and Trouve, A. (2007). POP: Patchwork of 
parts models for object recognition. Int. J. Comput. 
Vision 75 267-282. 

[2] Bell, A. and Sejnowski, T. J. (1997). The 'indepen- 
dent components' of natural scenes are edge filters. 
Vision Research 37 3327-3338. 

[3] BoRENSTEiN, E., Sharon, E. and Ullman, S. (2004). 

Combining top-down and bottom-up segmentation. 
In IEEE CVPR Workshop on Perceptual Organiza- 
tion in Computer Vision, Washington, DC 4 46. 

[4] Candes, E. J. and Donoho, D. L. (1999). Curvelets — a 
surprisingly effective nonadaptive representation for 



LEARNING ACTIVE BASIS MODELS BY EM-TYPE ALGORITHMS 



19 



objects with edges. In Curves and Surfaces (L. L. 
Schumaker et al., eds.) 105-120. Vanderbilt Univ. 
Press, Nashville, TN. 

[5] Chen, S., Donoho, D. and Saunders, M. A. (1999). 

Atomic decomposition by basis pursuit. SIAM J. 
Sci. Comput. 20 33-61. MR1639094 

[6] Daugman, J. (1985). Uncertainty relation for resolution 
in space, spatial frequency, and orientation opti- 
mized by two-dimensional visual cortical filters. J. 
Opt. Soc. Amer. 2 1160-1169. 

[7] Dempster, A. P., Laird, N. M. and Rubin, D. B. 

(1977). Maximum likelihood from incomplete data 
via the EM algorithm. J. Roy. Statist, Soc. Ser. B 
39 1-38. MR0501537 

[8] Donoho, D. L., Vetterli, M., DeVore, R. A. and 
Daubechie, I. (1998). Data compression and har- 
monic analysis. IEEE Trans. Inform. Theory 6 
2435-2476. MR1658775 

[9] Fergus, R., Perona, P. and Zisserman, A. (2003). 

Object class recognition by unsupervised scale- 
invariant learning. In Proceedings of Computer Vi- 
sion and Pattern Recognition, Madison, WI 2 264- 
271. 

[10] Friedman, J. H. (1987). Exploratory projection pursuit. 
J. Amer. Statist. Assoc. 82 249-266. MR0883353 

[11] Geman, S., Potter, D. F. and Cm, Z. (2002). Compo- 
sition systems. Quartarly Appl. Math. 60 707-736. 
MR1939008 

[12] Little, R. J. A. and Rubin, D. B. (1983). On jointly 
estimating parameters and missing data by maxi- 
mizing the complete data likelihood. Amer. Statist. 
37 218-220. 

[13] Mallat, S. and Zhang, Z. (1993). Matching pursuit in a 
time-frequency dictionary. IEEE Trans. Signal Pro- 
cess. 41 3397-3415. 

[14] Meng, X.-L. and van Dyk, D. (1997). The EM 
algorithm — an old folk-song sung to a fast new tune. 
J. Roy. Statist. Soc. Ser. B 59 511-567. MR1452025 



[15] Olshausen, B. a. and Field, D. J. (1996). Emergence 
of simple-cell receptive field properties by learning a 
sparse code for natural images. Nature 381 607-609. 

[16] Rabiner, L. R. (1989). A tutorial on hidden Markov 
models and selected applications in speech recogni- 
tion. Proc. IEEE 77 257-286. 

[17] Riesenhuber, M. and POGGIO, T. (1999). Hierarchical 
models of object recognition in cortex. Nature Neu- 
roscience 2 1019-1025. 

[18] Ross, D. A. and Zemel, R. S. (2006). Learning parts- 
based representations of data. J. Mach. Learn. Res. 
7 2369-2397. MR2274443 

[19] Simoncelli, E. p.. Freeman, W. T., Adelson, E. H. 

and Heeger, D. J. (1992). Shiftable multiscale 
transforms. IEEE Trans. Inform. Theory 38 587- 
607. MR1162216 

[20] SUDDERTH, E. B., TORRALBA, A., FREEMAN, W. T. and 

WiLLSKY, A. S. (2005). Learning hierarchical mod- 
els of scenes, objects, and parts. Proc. Int. Conf. 
Comput. Vision 2 1331-1338. 

[21] TiBSHlRANi, R. (1996). Regression shrinkage and selec- 
tion via the lasso. J. Roy. Statist. Soc. Ser. B 58 
267-288. MR1379242 

[22] Wu, Y. N., Si, Z., Fleming, C. and Zhu, S. C. (2007). 

Deformable template as active basis. Proc. Int. 
Conf. Comput. Vision, Rio de Janeiro, Brazil 1-8. 

[23] Wu, Y. N., Si, Z., Gong, H. and Zhu, S. C. 

(2009). Learning active basis model for object de- 
tection and recognition. Int. J. Comput. Vision 
DOI: 10. 1007/sl 1263-009-0287-0 . 

[24] YuiLLE, A. L., Hallinan, P. W. and Cohen, D. S. 

(1992). Feature extraction from faces using de- 
formable templates. Int. J. Comput. Vision 8 99- 
111. 

[25] Zhu, S. C. and Mumford, D. B. (2006). A stochas- 
tic grammar of images. Foundations and Trends in 
Computer Craphics and Vision 2 259-362. 



